Homework 1 Sample Solutions

DKU Stats 101 Fall 2024 Session 1

Author

Anonymous

Published

September 9, 2024

Part 1: One variable analysis

Q1: What kind of dataset do we have? (5 points)

  • According to the definitions in the textbook, describe the Five W’s for the following variables in the medallists.csv dataset.

    • medal_date
    • medal_type
    • medal_code
    • name
    • gender
    • country_code
    • discipline
    • event
    • birth_date
    • code_athlete
  • Categorical: name, gender, country_code, discipline, event
  • Ordinal: medal_type, medal_code
  • Identifier: name, code_athlete
  • Quantitative: medal_date, birth_date

Whether medal_date and birth date are really quantitative is debatable, however, in statistical analysis, dates are often treated as quantitative by counting the number of days since a given starting date.

Q2: Literature review (5 points)

Find a news article online (can be in either English or Chinese) that discusses what are some key results of the 2024 Paris Olympics in terms of athlete results and medals, particularly compared to previous Olympics. Make a list of at least three things we should expect or look for in the data.

Based on the article and your own personal expectations, what are some ways we might expect the data to be distributed or variables related? Make a list of at least three things we should expect or look for in the data and write a reason why we should expect it (no need to cite academic papers, just write down your reasons). Reasons should be thoughtful and at least two sentences explaining your logic for the expectation.

Points of emphasis:

  • The article must deal with the Olympics and expectations about athlete results. Logic about expectations must be coherent.

Q3: Describing the data (10 points)

Make a histogram of result for the 100 meter race competitors.

Figure 1: Results histogram

Describe it using the three features of quantitative data.

  • Shape: The distribution has a peak at around 10.2, but it is not smooth enough to be considered unimodal. It is right-skewed.

  • Center: The mean is 10.35, the median is 10.24 - both are quite close together.

  • Spread: IQR is 0.46, so 50% of the observations fall within 0.46 seconds, the middle half of the data. The standard deviation is 0, which is about 100% of the IQR, because the standard deviation is affected by extreme outliers. This also indicates a distribution with a skew or outliers.

Does the histogram of result surprise you?

One surprising feature is that the histogram cuts off fairly abruptly on the left side. There are quite a few times at around 9.8, but nothing below 9.79. Presumably this more or less the human limit of what is (currently) possible. Also, some of the runners are included multiple times, in multiple races, and they tend to run similar times, hence the aggregation.

Which is a better measure of center of the histogram, mean or median?

The median is usually better in a distribution that has outliers (such as this one), but in this case, both values are fairly close.

Make a nice table displaying the 5 number summary. Show your code in the document (echo: true).

kable(mens100m %>%
        summarise(min(result),  quantile(result, probs=0.25), median(result), quantile(result, probs=0.75), max(result)), 
      col.names = c("Min", "25%", "Median", "75%", "Max"))
Table 1: Results 5 number summary
Min 25% Median 75% Max
9.79 10.06 10.24 10.52 12.11

There are quite a few other ways to generate this result, the above is just one example.

  • Calculate the standard deviation using the sd() function. Interpret it - is it large or small? How does it compare to the IQR? What does this tell you about the shape of the distribution?

Standard deviation(sd) is a kind of evaluation for how far each value is from the mean, representing the the spread of the data distribution, so standard deviation is often discussed at the same time as the mean. The result of sd() equals to the square root of the variance, with the same unit of the original data, but it can be greatly affected by outliers or skew. In this case, the standard deviation is 0.45 which is quite similar to the IQR (0.46). This means that there are no powerful outliers, since they would have a greater effect on the standard deviation than on the IQR.

Would this histogram benefit from a transformation, in your opinion? Why or why not? If it would, please transform it appropriately, make a new histogram, and describe the transformation.

Not really, but if we really wanted to make one, a log transformation (appropriate because of the tail) would look like this. The fact that it looks similar to the plot above shows that it wasn’t really needed.

Figure 2: Rexpression of results
  • Make a boxplot chart comparing the median of result according to the a new variable stage_simple. If you previously transformed your data, keep it transformed for this step.
    • For this question, you will need to do a little bit of data manipulation. You will need to convert the variable stage into a new variable stage_simple using the mutate() verb you learned in DataCamp. Your goal is to simplify the variable stage into only 4 categories: Prelim, Round 1, Semis and Finals. One possible way to accomplish this is with the case_when() auxillary function as demonstrated in this link.
Figure 3: Distribution of results based on stage

Interpret this graph, particularly with respect to your previous histogram of the overall distribution - what new information does this boxplot display uncover?

This graph shows that the distribution of performance changes as the tournament progresses towards the finals. In the preliminaries, there is a large amount of variance, and even the best performers don’t match the level of performance shown in the last stages. This trend continues towards the finals, where the IQR becomes very small – at this point, only the best athletes are left, and they’re all quite competitive with one another.

Points of emphasis:

  • Well labeled graphs, with appropriate (not variable name) names for the x and y axes.
  • Appropriate order of boxplots (e.g. preliminaries come before round 1, etc.)
  • Legend labeled
  • Graphs that contain the correct amount of information
  • Reasonable, thoughtful interpretations of the requested statistics, not just one or two word answers.
  • Correct results for the requested statistics

Q4: Comparing categorical variables (10 points)

One interesting piece of information the organizing committee would like to know is how the top five medal winning countries (defined by total medals of gold + silver + bronze) fared in the individual vs. team events. Make a contingency table of those five countries by team vs. individual medals.

Table 2: Contingency table of medals for team and individual events
Individual Team
Australia 40 13
China 73 18
France 47 17
Japan 36 9
United States 94 32

We can see from this table that all top five countries won more medals from the individual events than from the team events. If the countries were sorted by how many medals they won, the order would be the same though, regardless of whether sorting would be done by individual or team. China is much closer in team medals to France and Australia than it is in individual medals, where it has far more.

Add margins to your table. Does it change your interpretation?

Table 3: Contingency table w/margins of medals for team and individual events
Individual Team Sum
Australia 40 13 53
China 73 18 91
France 47 17 64
Japan 36 9 45
United States 94 32 126
Sum 290 89 379

Adding margins to the table allows us to see the total count for each row and column. We can see that these countries won 290 medals from individual events, compared to 89 from team events. We also see the total medal count by country, but we already that, so it is not worth repeating here. The grand total (379) is also displayed.

Now convert your table into a proportions table. Does this better help explain what the data show?

Table 4: Proportions table of medals for team and individual events
Individual Team
Australia 0.75 0.25
China 0.80 0.20
France 0.73 0.27
Japan 0.80 0.20
United States 0.75 0.25

Interpret your table. What does this table indicate to you? Are you surprised by it? Why do you think you see the results you see here? What other information would be useful to understand why you see these results?

Around two thirds of all medals won by the top 5 countries tend to be from individual events. China and Japan have the highest proportion of their medals won from individual events, France the least. We can only speculate as to why, but it might for example be possible that team sports are more popular in France.

Points of emphasis:

  • Reasonable, thoughtful interpretations of the requested statistics, not just one or two word answers.
  • Correct results for the requested statistics

Q5: Understanding and comparing distributions (5 points)

Another sport of interest to the organizing committee is diving. Using the five number summaries, calculate if result has any outliers according to the rule described in the textbook for outliers in boxplots. Show your calculations. Do you believe the outliers identified are real outliers? Why or why not? Consider the purpose of your report when preparing your answer.

diving <- read.csv("olympic data/results/Diving.csv")

kable(diving %>%
        summarise(min(result),  
                  quantile(result, probs=0.25), 
                  median(result), 
                  quantile(result, probs=0.75), 
                  max(result)), 
      col.names = c("Min", "25%", "Median", "75%", "Max"))
Table 5: Diving results (points) 5 number summary
Min 25% Median 75% Max
188.5 287.3775 346.975 410.3125 547.5

Boxplot calculations for result:

result_med <- median(diving$result)
result_lq <- quantile(diving$result, probs=0.25)
result_uq <- quantile(diving$result, probs=0.75)
result_iqr <- IQR(diving$result)
result_uf <- result_uq + 1.5*result_iqr
result_lf <- result_lq - 1.5*result_iqr
  • \(median=346.975\)
  • \(IQR=410.3125-287.3775=122.935\)
  • \(Upper\,fence=410.3125+1.5\cdot122.935=594.715\)
  • \(Lower\,fence=287.3775-1.5\cdot122.935=102.975\)
  • There are no values beyond the fences - so result does not have any outliers.

Create two graphs of boxplots, one of result by stage and one of result by event_name. What can you conclude from these displays? Does it change your answer about outliers in the first part of the question?

Figure 4: Distribution of diving results (points) by stage

We can see that divers generally did worse in the preliminary stage. Semifinals and finals on the other hand are fairly similar. This makes sense - at point, only the best divers are left in the tournament, and they are likely to be close together in performance.

Figure 5: Distribution of diving results (points) by event name

This boxplot reveals that there are large differences in the number of points awarded in the men’s events, compared to the women’s. For men, the median points awarded are roughly around 400 for all four types. For women, the medians are around 300. Furthermore, for men, the synchronised 3m springboard has by far the lowest median, whereas for women, it is in line with the others. Moreover, 400 would be an outlier for women, but fairly usual for men. Similarly, the outliers at the lower end of the scale for men would be within the IQR for women.

Points of emphasis:

  • Boxplots well labeled
  • Proper calculation of 5 number summaries
  • Shows work for calcuation of outliers
  • Makes a reasonable interpretation of the boxplot

Q6: The Normal distribution (10 points)

According to Our World in Data, the average human male height is 178.4 cm with a standard deviation of 7.59 cm, and the average female height is 164.7 cm with a standard deviation of 7.07 cm.

In the formal notation introduced in the textbook, write the Normal model of human height for males and females.

\(N(178.4, 7.59)\)

\(N(164.7, 7.07)\)

Filter the dataset for athletes in the athletics events. Find the mean height for men and women as well as the standard deviations. Is this close to the global averages? Why do you think the results either match or do not match the global averages?

Table 6: Athlete height - summary statistics
Gender Mean Standard deviation
Female 169.37 7.99
Male 182.24 8.23

The results are a little above global averages, which seems reasonable. Taller people are often at an advantage in many athletic disciplines, so it is plausible that Olympic athletes would be a little taller than the global population. The standard deviations are also a fair bit greater. Also, not all countries send the same number of athletes to the Olympics, and height is unevenly distributed between countries.

Select five athletes at random from the athletics category and calculate their \(z\) score relative to the mean and standard deviation you found in the previous part (show both your code and calculations here).

set.seed(123)
random_5_athletes <- athletes_athletics %>%
  slice_sample(n=5)

# Note: this code assumes that at least 1 male and 1 female athlete was sampled
z_scores_male <- (random_5_athletes$height[random_5_athletes$gender == "Male"] - height_athletics$height_mean[height_athletics$gender == "Male"])/height_athletics$height_sd[height_athletics$gender == "Male"]
z_scores_female <- (random_5_athletes$height[random_5_athletes$gender == "Female"] - height_athletics$height_mean[height_athletics$gender == "Female"])/height_athletics$height_sd[height_athletics$gender == "Female"]

print(z_scores_male)
[1]  0.09177499 -0.27281550  0.57789565
print(z_scores_female)
[1] 0.5795689 0.3291602

Make a density plot (an example of this type of display is here) of the heights of male and female athletes in the athletics events. Do you think it is justified to model these athletes heights as being normally distributed? Why or why not?

Figure 6: Density plot of height by gender

The nearly normal condition appears to be met - both distributions are unimodal, (mostly) symmetric, and have no extreme outliers.

Part 2: Two variable analyis

Q7: Relationship between variables (15 points)

Make a scatterplot of Total as a function of Athletes. Add a linear smoother to the plot and label any points you consider to be an outlier using geom_text() - the label for the outlier should print the observation’s country_code. If necessary, transform any variables.

Figure 7: Athletes vs. medals

There aren’t really any extreme outliers, but for the sake of this question, let’s consider all countries with more than 200 athletes or 25 medals to be outliers, given that the relationship appears weaker for those below that.

  1. Do you think there is a clear pattern? Describe the association between Athletes and Total.

There appears to be a positive linear relationship.

  • Direction - Positive
  • Form - Mostly linear. Countries that send more athletes to the olympics win more medals.
  • Strength - Medium; the pattern holds for most countries, but there are some that under/overperform relative to the number of athletes sent. Also, the trend is weaker (but still exists) for the countries not labeled in the plot.
  • Outliers - USA, China, Great Britain and South Korea overperformed, many other countries such as Spain, Germany, and Poland underperformed.

Note that if we transform both variables, there are effectively no outliers:

Figure 8: Log athletes vs. log medals, no outliers
  1. Find out the details of any outliers you have identified. Do you think the outlier(s) should be excluded from the analysis? Why or why not?
Table 7: Athletes vs. medals - outliers
Country Athletes Total
United States 619 126
China 398 91
Japan 431 45
Australia 475 53
France 600 64
Netherlands 289 34
Great Britain 342 65
Korea 147 32
Italy 397 40
Germany 457 33
New Zealand 208 20
Canada 332 27
Spain 401 18
Brazil 290 20
Poland 226 10

Make a second graph excluding any outliers you have identified and believe should be excluded (keep the variables transformed if you had them previously transformed)

Figure 9: Athletes vs. medals - no outliers
  1. What do you estimate the correlation to be, without using technology?

Any reasonable guess is ok here.

  1. Check the conditions for correlation
  • Quantitative variables condition: both are quantitative
  • Straight enough condition: the relationship is more or less straight
  • No outliers condition: there are a few outliers that cannot be excluded, though probably will not result in a big change in the estimate.
  1. Find and interpret the correlation coefficient for this relationship

0.65 is a fairly strong correlation - indicative of the fact that countries that send more athletes to the Olympics also tend to win more medals.

  1. Interpret this graph.

For countries that send more than about 150 athletes, it appears a linear model works less well, Transforming both variables leads to a pretty good linear model however.

  1. Now, make a third graph but display the points separately for gold, silver, and bronze medals.. Add a linear smoother for each set of points. (You can set up your graph for the relationship with bronze medals, save it, and then add each other medal as a new layer to your saved graph).
Figure 10: Athletes vs. medals by medal type

How does this graphical display change your interpretation you developed in your answer to part 6? Why do you think you the relationship is structured like this? Explain.

The relationship doesn’t really depend on the medal type. The intercept is a little higher for Bronze, and the slope a little less steep for Silver, but that’s about it.

Q8: Putting it all together (15 points)

Through the analysis conducted in the previous section and through at least one additional investigation of your own (which can be an additional graph or table, that analyzes a different relationship or distribution than one asked about in the questions above but you think is meaningful and important to communicate to the organizing committee), write three paragraphs outlining what you think are the main findings of questions 1-7 plus your additional investigation. What would you recommend to your organizing committee as to how to improve your country’s performance at the next Olympics? What are some important factors and relationships you discovered that you think they ought to pay attention to? What are some next steps and additional data that are needed to deepen this analysis?

  • Analysis here can vary but must be at least two paragraphs
  • Should accurately summarize the information discovered by answering the previous questions
  • B-level answer will conduct a shallow additional analysis, A-level answer will show interesting additional analysis that builds on previous answers
  • Shows a good understanding of the limits of this dataset
  • Should be as precise as possible, don’t use general statements when you can be more specific